Configurable neural speech synthesis

ABSTRACT

A discriminator trained on labeled samples of speech can compute probabilities of voice properties. A speech synthesis generative neural network that takes in text and continuous scale values of voice properties is trained to synthesize speech audio that the discriminator will infer as matching the values of the input voice properties. Voice parameters can include speaker voice parameters, accents, and attitudes, among others. Training can be done by transfer learning from an existing neural speech synthesis model or such a model can be trained with a loss function that considers speech and parameter values. A graphical user interface can allow voice designers for products to synthesize speech with a desired voice or generate a speech synthesis engine with frozen voice parameters. A vector of parameters can be used for comparison to previously registered voices in databases such as ones for trademark registration.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation application of U.S. Non-Provisionalpatent application Ser. No. 17/341,082, filed Jun. 7, 2021, which claimsthe benefit of U.S. Provisional Patent Application No. 62/705,127,entitled “CONFIGURABLE NEURAL SPEECH SYNTHESIS,” filed Jun. 12, 2020;which is incorporated herein by reference for all purposes.

BACKGROUND

As people are increasingly utilizing a variety of computing devices,including portable devices such as tablet computers and smart phones, itcan be advantageous to adapt the ways in which people interact withthese devices. For example, different voice data may be desirable for avariety of applications. In an example, it may be desirable to generatetext-to-speech (TTS) voices for video game characters to provide a moreinteractive and immersive gaming experience. In another example, a usermay desire a TTS voice that represents their qualities, such as gender,age, regional accent, etc. However, conventional TTS voices for speechsynthesis, using, e.g., concatenative or other approaches, are trainedon a single speaker. As such, the playback sound is configurable onlyalong typical digital signal processing (DSP) parameters such as pitchand speed. As a result, machines using a voice sound the same or, formachines to have unique sounding voices, a large or expensive effort isrequired to collect training data. This is often not practical forvoice-enabling large numbers of diverse devices including ones fromsmall companies or developers with financial or time-to-marketconstraints. Accordingly, it is desirable to provide improved techniquesfor text-to-speech.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings illustrate several embodiments and, togetherwith the description, serve to explain the principles of the inventionaccording to the embodiments. It will be appreciated by one skilled inthe art that the particular arrangements illustrated in the drawings aremerely exemplary and are not to be considered as limiting of the scopeof the invention or the claims herein in any way.

FIG. 1 illustrates an example of a user receiving synthesized speechaudio in accordance with embodiments herein;

FIG. 2 illustrates an example environment in which aspects of thevarious embodiments can be utilized;

FIG. 3A illustrates a configurable speech synthesis model in accordancewith various embodiments;

FIG. 3B illustrates a speech audio waveform according to an embodiment;

FIG. 4A illustrates a configurable speech synthesis model in accordancewith an alternate embodiment;

FIG. 4B illustrates a speech audio spectrogram according to anembodiment;

FIG. 5 illustrates an example process for training a voice propertydiscriminator in accordance with various embodiments;

FIG. 6 illustrates an example process for training a speech synthesismodel on transcribed speech in accordance with various embodiments;

FIG. 7 illustrates an example process for training a configurable speechsynthesis model in accordance with various embodiments;

FIG. 8 illustrates an example process for jointly training a speechsynthesis model on discriminated voice property values and transcribedspeech in accordance with various embodiments;

FIG. 9 illustrates an example configurable neural speech synthesis modeltrained on multiple voice properties in accordance with variousembodiments;

FIG. 10 illustrates an example interface for configuring andsynthesizing speech audio in accordance with various embodiments;

FIG. 11 illustrates an example interface for configuring and generatinga speech synthesizer in accordance with various embodiments;

FIG. 12 illustrates an example process for ensuring distinct voices forbrands in accordance with various embodiments;

FIG. 13 illustrates an example process for examining trademarkregistration applications in accordance with various embodiments;

FIG. 14A illustrates an example process for training a speech synthesismodel in accordance with various embodiments;

FIG. 14B illustrates an example process for generating synthesizedspeech in accordance with various embodiments;

FIG. 14C illustrates an example process for configuring a speechsynthesizer in accordance with various embodiments;

FIG. 15A illustrates an example non-transitory computer readable mediumin which aspects of the various embodiments can be utilized;

FIG. 15B illustrates an example non-transitory computer readable mediumin which aspects of the various embodiments can be utilized;

FIG. 16A illustrates an example rack-mounted server computer in whichaspects of the various embodiments can be utilized; and

FIG. 16B illustrates an example diagram of a server computer system inwhich aspects of the various embodiments can be utilized.

DETAILED DESCRIPTION

Systems and methods in accordance with various embodiments of thepresent disclosure may overcome one or more of the aforementioned andother deficiencies experienced in conventional approaches to speechsynthesis. In particular, various embodiments described herein providefor configurable neural speech synthesis that may be used separately orin combinations within devices, systems, processes, and methods.

In an embodiment, one example includes a computerized process oftraining a model (e.g., a neural speech synthesis model or a speechsynthesis model) that can generate speech audio (also referred to asvoice data) conditioned on a value of a voice property. In this example,source samples of speech audio (e.g., voice data from an individual suchas a voice donor or machine-generated voice data from a TTS system orother audio generation system) are obtained. The source samples arelabeled with discrete values of a voice property, including, forexample, a gender voice property, an age voice property, an accent voiceproperty, a timbre voice property. Other voice properties may indicatewhether the source samples indicate the attitude of the speaker, such aswhether the speaker appears happy, sad, calm, excited, formal, casual,etc.

A discriminator is trained from the source samples and labels. Thediscriminator is configured to generate a probability value thatquantifies the likelihood of the voice property from a sample of speechaudio.

A model (e.g., neural speech synthesis model or synthesis model) istrained by synthesizing a multiplicity of synthesized speech samplesusing the model with a diverse set of voice property values.Corresponding properties are generated for the synthesized speechsamples using the discriminator. A property-learning weight adjustmentis generated by back-propagating changes to minimize a loss functionthat depends on the difference between the voice property values andcorresponding probabilities.

In certain embodiments, synthesizing the multiplicity of synthesizedspeech samples uses a transcription of source samples, and the processfurther comprises computing a source-matching weight adjustment byback-propagating changes to minimize a loss function that depends ondifferences between the source samples and the synthesized speech. Sucha process allows for the simultaneous training of the neural speechsynthesis model for the conversion of text to speech and the ability toprovide different voice sounds. The process also prevents the synthesismodel from learning to generate an undesirable output or other outputsignal that causes the discriminator to output inauthentic or otherwiseundesirable speech (e.g., an output that does not sound like real orexpected speech). Simultaneous training is an alternative to firsttraining a general speech synthesis model and then augmenting thetraining to be able to create variations in voices.

Thereafter, in response to receiving a string of text and at least onevoice property value at the model (e.g., the neural speech synthesismodel or synthesis model), the model evaluates the string of text andthe voice property value to convert the text to speech audio in a voicebased on the voice property value. Said in another way, the modelsynthesizes speech audio corresponding to the text based on the voiceproperty value. The at least one voice property can be ones that aremeaningful to a user, such as gender. This allows a user to quickly andeasily try different voice sounds and thereby find a voice that meetsthe needs of their product or use. Further, it allows for saving theproperty values and comparing them to others to ensure that they aredifferent enough that different products' voices will be distinct. Forexample, users can adjust the sound of the synthesized voice by makingit more male or younger or having a stronger Texas accent. Suchconfigurability has the benefit of enabling rapid experimentation andtesting of voices that can affect the perception and relatability ofmachines that employ speech synthesis as configured.

Instructions for causing a computer system to configure a speechsynthesizer in accordance with the present disclosure may be embodied ona computer-readable medium. For example, in accordance with anembodiment, a backend system can receive at least one voice propertyvalue. The backend system can generate code for execution by a computer,the code implementing a neural network wherein a node in a hidden layerincludes, in its summation, a constant term derived from the product ofthe voice property value and a weight learned from a training process.The backend system can output the code, wherein the code implements aspeech synthesis function within the speech synthesizer.

Embodiments provide a variety of advantages. For example, in accordancewith various embodiments, computer-based approaches for configuring aspeech synthesizer can be utilized by content providers, devicemanufacturers, etc., and consumers of the content providers and devicemanufacturers. The speech synthesizer systems and approaches can improvethe operation and performance of the computing devices on which they areimplemented by, among other advantages, generating computer code for aspeech synthesizer in which the TTS voice is frozen as configured by theat least one voice property value. This allows for creating embeddedsystem devices or other systems that have a specific voice. Such systemscan integrate the computer code in a modular way that simplifies thedesign of such systems. Further, it becomes impractical to change thevoice such that once a user chooses and pays for a voice, they cannotchange it without a second performing of the method.

The speech synthesizer system and approaches can be used bycomputer-based techniques to optimize resource utilization of variousresources, for example, by generating code in a binary format. Thisimproves modularity and further frustrates attempts at reverseengineering or changing the sound of the synthesized voice.

Further, because the voice property value may constitute a voiceproperty vector, the speech synthesizer system and approaches allow forreading at least one stored voice property vector from a brand databaseand computing a distance between the stored voice property vector andthe received property vector. This advantageously allows for ameasurable comparison of the similarity of any two voices. For example,in response to the computed distance being closer than a thresholddistance, an error message can be generated, which can be used to alertand/or prevent users from configuring a voice that is too similar toanother voice. This avoids having different products in the marketplacewith voices so similar that users of the product could be confused aboutwhich one is producing synthesized voices. In another example, inresponse to the computed distance being farther than a thresholddistance, the received property value can be stored in the branddatabase. This allows for creating a database that is useful forcomparing to future voice configurations to ensure branded voicedifferentiation.

Further still, the speech synthesizer system and approaches allow forexamining trademarks. For example, the speech synthesizer system andapproaches comprise receiving a specimen of speech audio with anapplication for a trademark registration; applying a discriminator of aplurality of voice property values to the specimen to compute a voiceproperty vector; computing distances between the computed voice propertyvector and other voice property vectors stored in a database; anddetermining allowability of the application in dependence upon thesmallest computed distance being greater than a threshold. Such anapproach enables a government to examine voice trademark registrationapplications quickly and effectively to allow registrants to prevent theuse of synthesized voices that could cause confusion as to the source ofgoods and services.

Various other functions and advantages are described and suggested belowas may be provided in accordance with the various embodiments.

FIG. 1 illustrates an example situation 100 wherein a user 102 isinteracting with a computing device 104. More specifically, computingdevice 104 is providing synthesized speech 106 for a gaming character toprovide a more interactive and immersive gaming experience. Although aportable computing device (e.g., a smart phone, an electronic bookreader, or tablet computer) is shown, it should be understood thatvarious other types of electronic device that are capable of determiningand processing input can be used in accordance with various embodimentsdiscussed herein. These devices can include, for example, televisions,notebook computers, personal data assistants, video gaming consoles orcontrollers, portable media players, and wearable computers (e.g., smartwatches, smart glasses, etc.) among others. The computing device 104includes a speaker to play audio including, for example, voice or speechdata. The device can render an interface such as an applicationinterface that can present content. The content can include text,images, video, audio, etc.

As described, speech synthesis is starting to become commonplace incomputers, smartphones, and embedded systems such as smart speakers,robots, automobiles, mobile, portable, and wearable devices, computerterminal interfaces, telephone interactive voice response systems,public address systems, and others.

Certain companies and brands have invested in creating identifiable andsometimes trademarked sounds. For example, the roar of the lion at thebeginning of Metro Goldwyn Mayer movies, the sound of a lightsaber inStar Wars, the jingle of T-Mobile phones, the DaDaDa DaDaDa sound of theESPN sports entertainment network, the bloop of a Tivo remote controloperation, and Homer Simpson's D'oh annoyed grunt. Huge variations ofhuman voices are possible and yet some are clearly identifiable. Forexample, many people can recognize the voices of James Earl Jones, JackNicholson, or Kathleen Turner even without seeing their image.

As ever more different systems synthesize speech it is increasinglycommon for different systems to have similar-sounding voices, which isundesirable in part because it can create confusion among users and inpart because it means that the systems associated with brands do nothave a unique identity. Though synthesized speech can say essentiallyany words, people can recognize the sound of a voice no matter whatwords it says. To create recognizable brands, makers of voice-enabledsystems desire for their systems to have voices that are bothdistinctive and have certain properties. It is also desirable for theproviders of neural speech synthesis and related technologies to be ableto provide such unique voices.

Voice designers want to be able to configure the voices by makingchanges and adjustments in ways that they expect. For example, theymight want a voice that sounds a little bit younger or a little bit morelike it has a New York accent. In another example, it may be desirablefor user 102 to interact with game characters having different andvarying voices. In this way, in an embodiment, a speech synthesis systemshould take as input voice property values along dimensions that areperceptibly meaningful such as gender, age, and accent.

Accordingly, in accordance with various embodiments, embodiments providefor configurable neural speech synthesis, which uses parametric speechsynthesis that uses a neural network architecture to generate speechaudio features. Configurable neural speech synthesis may be configurableby parameters, the values (e.g., gender, age, and accent) of whichrelate to voice properties in a way that has perceptible meaning. In anembodiment, TTS voice properties include natural voice characteristics,accents, and attitudes. Voice characteristics relate to physiologicalattributes of a voice, such as ones that vary distinguishably betweengender and age. Accent relates to learned ways of producing phonemes,such as the variations between regions and ethnicities. Attitudes relateto feelings such as happiness, calmness, and formalness.

This is in contrast to voices defined by voice embeddings in amachine-learned space such as X-vectors. The combined configurable rangeof each voice property parameter enables the speech synthesizer tosynthesize a wide range of human-sounding voices. Furthermore,configurable neural speech synthesis may be language-specific oruniversal.

In various embodiments, beyond merely configuring voice properties asinput parameters to speech synthesis, tags within the text tosynthesize, in a format such as speech synthesis markup language (SSML),can indicate dynamic voice parameter values along dimensions learned bya neural network.

FIG. 2 illustrates an example environment 200 in which aspects of thevarious embodiments can be implemented. It should be understood thatreference numbers are carried over between figures for similarcomponents for purposes of simplicity of explanation, but such usageshould not be construed as a limitation on the various embodimentsunless otherwise stated. In this example, a user can utilize a clientdevice 202 to communicate across at least one network 204 with aresource provider environment 206. The client device 202 can include anyappropriate electronic device operable to send and receive requests orother such information over an appropriate network and conveyinformation back to a user of the device. Examples of such clientdevices 202 include personal computers, tablet computers, smartphones,notebook computers, and the like. The user can include a personauthorized to manage the aspects of the resource provider environment

The resource provider environment 206 can provide speech synthesisservices. These services can, for example, train a model (e.g., a neuralspeech synthesis model or a speech synthesis model) that can generatespeech audio (also referred to as voice data) conditioned on a value ofa voice property. This allows a user to quickly and easily try differentvoice sounds and thereby find a voice that meets the needs of theirproduct or use. Further, it allows for saving the property values andcomparing them to others to ensure that they are different enough thatdifferent products' voices will be distinct. In various embodiments, thespeech synthesis services can be performed in hardware and software, orin combination thereof.

The network(s) 204 can include any appropriate network, including anintranet, the Internet, a cellular network, a local area network (LAN),or any other such network or combination, and communication over thenetwork can be enabled via wired and/or wireless connections.

The resource provider environment 206 can include any appropriatecomponents for training a model (e.g., a neural speech synthesis modelor a speech synthesis model) that can generate speech audio (alsoreferred to as voice data) conditioned on a value of a voice property,receiving speech data, presenting interfaces, etc. It should be notedthat although the techniques described herein may be used for a widevariety of applications, for clarity of presentation, examples relate tospeech synthesizing applications. The techniques described herein,however, are not limited to speech synthesizing applications, andapproaches may be applied to other situations where managing voice datais desirable, such as creating voice banks, verifying voice data,trademarks, etc.

The resource provider environment 206 might include Web servers and/orapplication servers for obtaining and processing voice data to train amodel (e.g., a neural speech synthesis model or a speech synthesismodel) that can generate speech audio (also referred to as voice data)conditioned on a value of a voice property. While this example isdiscussed with respect to the internet, web services, and internet-basedtechnology, it should be understood that aspects of the variousembodiments can be used with any appropriate services available oroffered over a network in an electronic environment, or devicesotherwise not connected or intermittently connected to the internet.

In various embodiments, resource provider environment 206 may includevarious types of resources 214 that can be used to facilitate speechsynthesis services. The resources can facilitate, for example, customvoice system 222, voice training system 224, application serversoperable to process instructions provided by a user or database serversoperable to process data stored in one or more data stores 216 inresponse to a user request.

Custom voice system 222 is operable to receive a string of text and atleast one voice property value. Custom voice system 222 evaluates thestring of text and the voice property value to convert the text tospeech audio in a voice based on a value of the voice property value.Custom voice system 222 is described in greater detail below.

Voice training system 224 is operable to train a model (e.g., a neuralspeech synthesis model or a speech synthesis model) that can generatespeech audio (also referred to as voice data) conditioned on a value ofa voice property. For example, source samples of speech audio (e.g.,voice data from an individual such as a voice donor or machine-generatedvoice data from a TTS system or other audio generation system) areobtained and the source samples are labeled with discrete values of avoice property. Voice training system 224 trains a discriminator fromthe source samples and labels. Voice training system 224 trains a model(e.g., neural speech synthesis model or synthesis model) by synthesizinga multiplicity of synthesized speech samples using the model with adiverse set of voice property values. Corresponding properties aregenerated for the synthesized speech samples using the discriminator.Voice training system 224 computes a property-learning weight adjustmentby back-propagating changes to minimize a loss function that depends onthe difference between the voice property values and correspondingprobabilities.

In at least some embodiments, an application executing on the clientdevice 202 that needs to access resources of the provider environment206, for example, to initiate an instance of custom voice system 222 cansubmit a request that is received to interface layer 208 of the providerenvironment 206. The interface layer 208 can include applicationprogramming interfaces (APIs) or other exposed interfaces, enabling auser to submit requests, such as Web service requests, to the providerenvironment 206. Interface layer 208 in this example can also includeother components as well, such as at least one Web server, routingcomponents, load balancers, and the like.

When a request to access a resource is received at the interface layer208 in some embodiments, information for the request can be directed toresource manager 210 or other such systems, service, or componentconfigured to manage user accounts and information, resourceprovisioning and usage, and other such aspects. Resource manager 210 canperform tasks such as communicating the request to a managementcomponent or other control component which can be used to manage one ormore instances of a custom voice system as well as other information forhost machines, servers, or other such computing devices or assets in anetwork environment, authenticate an identity of the user submitting therequest, as well as to determine whether that user has an existingaccount with the resource provider, where the account data may be storedin at least one data store 212 in the resource provider environment 206.For example, the request can be used to instantiate custom voice system222 on host machine 230.

It should be noted that although host device 230 is shown outside theprovider environment, in accordance with various embodiments, one ormore components of custom voice system 222 can be included in providerenvironment 206, while in other embodiments, some of the components maybe included in the provider environment. It should be further noted thathost machine 230 can include or at least be in communication with othercomponents, for example, content training and classification systems,image analysis systems, audio analysis systems, etc.

The various computing devices described herein are exemplary and forillustration purposes only. The system may be reorganized orconsolidated, as understood by a person of ordinary skill in the art, toperform the same tasks on one or more other servers or computing deviceswithout departing from the scope of the invention. The resources may behosted on multiple server computers and/or distributed across multiplesystems. Additionally, the components may be implemented using anynumber of different computers and/or systems. Thus, the components maybe separated into multiple services and/or over multiple differentsystems to perform the functionality described herein. In someembodiments, at least a portion of the resources can be “virtual”resources supported by these and/or other components.

One or more links couple one or more systems, engines or devices to thenetwork 204. In particular embodiments, one or more links each includesone or more wired, wireless, or optical links. In particularembodiments, one or more links each includes an intranet, an extranet, aVPN, a LAN, a WLAN, a WAN, a MAN, a portion of the Internet, or anotherlink or a combination of two or more such links. The present disclosurecontemplates any suitable links coupling one or more systems, engines ordevices to the network 204.

In particular embodiments, each system or engine may be a unitary serveror may be a distributed server spanning multiple computers or multipledatacenters. Systems may be of various types, such as, for example andwithout limitation, web server, advertising server, file server,application server, or proxy server. In particular embodiments, eachsystem may include hardware, software, or embedded logic components or acombination of two or more such components for carrying out theappropriate functionalities implemented or supported by their respectiveservers. For example, a web server is generally capable of hostingwebsites containing web pages or particular elements of web pages. Morespecifically, a web server may host HTML files or other file types ormay dynamically create or constitute files upon a request andcommunicate them to client devices or other devices in response to HTTPor other requests from client devices or other devices.

In particular embodiments, one or more data storages may becommunicatively linked to one or more servers via one or more links. Inparticular embodiments, data storages may be used to store various typesof information. In particular embodiments, the information stored indata storages may be organized according to specific data structures. Inparticular embodiment, each data storage may be a relational database.Particular embodiments may provide interfaces that enable servers orclients to manage, e.g., retrieve, modify, add, or delete, theinformation stored in data storage.

The system may also contain other subsystems and databases, which arenot illustrated in FIG. 2 , but would be readily apparent to a person ofordinary skill in the art. For example, the system may include databasesfor storing data, storing features, storing outcomes (training sets),and storing models. Other databases and systems may be added orsubtracted, as would be readily understood by a person of ordinary skillin the art, without departing from the scope of the invention.

Training

Configurable neural speech synthesis uses a generative neural networkthat is a product of a training process, such as one implemented usingvoice training system 224. Multiple approaches to training are possible,and some examples are described below and can be utilized in voicetraining system 224. Some examples of a training process use supervisedor semi-supervised learning, which requires samples of speech labeledaccording to discrete values of a voice property. Training labels arediscrete values such as Booleans or enumerated types. Some examples oftypes of labels for training samples include child or not, male orfemale, one of several languages, one of several regional accents of alanguage such as New York, Texas, or

China, timbre such as nasal, bright, or croaky, happy or sad, calm orexcited, and formal or casual. Limiting the possible values of labelsmakes it easier for humans to label training samples at an acceptablerate. Asking human labelers to listen to speech recordings and estimatevalues on a continuous scale would slow labeling down.

Inference

A model capable of inferring probabilities of properties of certaininput samples is both a part of training and a result of training aconfigurable neural speech synthesis model. FIG. 3A illustrates anexemplary embodiment 300 of the custom voice system 222 in accordancewith various embodiments. Custom voice system 222 can be implementedusing software and/or hardware. It should be noted that the componentsof custom voice system 222 may be distributed among multiple servercomputers. For example, some servers could implement data collection andother servers could implement TTS voice synthesis. Further, some ofthese operations could by performed by other computers as describedherein.

In this example, custom voice system 222 can include ingestion component302, voice synthesis engine 306, text data store 304, and voice propertyvalue data store 308. Voice synthesis engine 306 can includeconfigurable neural speech synthesis inference model 310.

Ingestion component 302 is operable to obtain text data and userpreference data (e.g., voice property value data) from various sourcesvia an interface. Sources may include one or more content providers.Content providers can include, for example, users, movie agencies,broadcast companies, cable companies, internet companies, gamecompanies, vending and retail services companies, music and videodistribution companies, government agencies, automobile companies, etc.In an embodiment, once the sources are identified, a variety ofmethodologies may be used to retrieve the relevant media data via theinterface, including but not limited to, data scrapes, API access, etc.The text data may be stored in text data store 304 and the voiceproperty value data may be stored in voice property value data store308.

In an embodiment, the interface may include a data interface and aservice interface that may be configured to periodically receive text,voice property value data, and/or other data. The interface can includeany appropriate components known or used to receive requests or otherdata from across a network, such as may include one or more applicationprogramming interfaces (APIs) or other such interfaces for receivingsuch requests and/or data.

Configurable neural speech synthesis inference model 310 is capable ofinferring probabilities of properties of certain input samples and isboth a part of training and a result of training a configurable neuralspeech synthesis model. For example, configurable neural speechsynthesis inference model 310 is operable to receive input text and oneor more voice property values and generate synthesized speech audio asan output. The output can be stored in synthesized audio data store 312or other appropriate data store, and/or otherwise utilized. FIG. 3Billustrates example 320 of an audio wave 322 of synthesized voice audiooutput from voice synthesis engine 306.

In an embodiment, some neural speech synthesis models may use more thanone internal neural network. For example, one may be trained to producean audio spectrogram, and another uses the spectrogram to produce awaveform. Other ways of dividing the work of speech synthesis betweendifferent neural and expert-designed models are possible. FIG. 4Aillustrates an exemplary embodiment 400 of the custom voice system 222showing additional components in accordance with various embodiments. Inthis example, custom voice system 222 represents an example two-pieceinference model for configurating neural speech synthesis and includesfeature model 402 and vocoder 404. In an embodiment, high-level featuremodel 402 takes as input text to be converted to speech audio and one ormore voice property values. It produces a spectrogram of speech asoutput. A vocoder 404 takes as input the spectrogram and producessynthesized speech audio as an output that can be stored in synthesizedaudio data store 312 or other appropriate data store, and/or otherwiseutilized. FIG. 4B illustrates example 420 of spectrogram 422 of speechaudio produced by high-level feature model 402 and used as input to avocoder 404.

Discriminator

In an embodiment, one example of neural speech synthesis uses adiscriminator as part of the training process. The discriminator takesin an audio sample sourced from a corpus of training audio samples andcomputes a probability of it being associated with one or more specificlabels. In some examples, the discriminator is a model trained usingmachine learning such as a neural network, supervised or semi-supervisedtraining can be possible. It is also possible to use an expert-designedmodel that is not trained from data.

FIG. 5 illustrates example 500 of a system of training a discriminatorneural network model 504. In this example, the training includes aprocess of obtaining source samples of speech audio (e.g., voice datafrom an individual such as a voice donor or machine-generated voice datafrom a TTS system or other audio generation system) labeled withdiscrete values of one or more voice properties (not shown). An initialdiscriminator model 504 processes the source samples to compute aprobability for one or more of the voice properties. A training process502 compares the computed probability with the actual property labelassociated with the source sample using a loss function represented as:

loss=probability of property−Boolean property label   Eq. (1)

It should be noted that other loss functions are possible, such as onesthat sum the loss of multiple properties. Such sums could be weightedbased on the relative importance of each property. Other mathematicalfunctions in the loss function may be appropriate for specific systemconstraints.

The training process 502 proceeds to compute, for parameters of thediscriminator neural network, error gradients. It is not strictlynecessary to compute a gradient for each parameter. The training process502 proceeds to apply adjustments to the weights of the discriminatormodel 504 according to the gradients. The amount of adjustment can bescaled by a factor that controls the learning rate. Various othermachine learning techniques for training neural networks are possible.

Different source samples will produce different probabilities within therange of 0 to 1. A trained discriminator may tend to produce outputvalues as being between 1 to 0, advantageously providing diversity ofoutput probabilities. For example, if diversity is low, someexperimentation with removing a SoftMax output or having independentsigmoid outputs for different properties can be helpful. Limiting theamount of training, and therefore the prediction certainty, can also behelpful. The requirements may be application specific.

Transfer Training

A trained neural speech synthesis model can be a baseline model, whichcan be adapted to vary based on parameter input values as expected byusers. Training neural speech synthesis models, such as Tacotron and itsprogeny, can use a loss function that compares model output to sourcetraining samples. This can be done, for example, by comparingspectrograms with a loss function such as one represented by:

loss=sum over bins(abs(recording spectrogram bin−speech spectrogrambin))   Eq. (2)

Mean squared error or other alternatives to an absolute value areappropriate for some models and applications.

FIG. 6 illustrates example 600 of training a baseline neural speechsynthesis model 604 in accordance with various embodiments. In thisexample, the speech synthesis model 604 takes in transcriptions oftraining speech audio samples and produces synthesized speech audio asoutput. A training process 602 compares the synthesized speech with thesource training audio sample using the loss function above. Other lossfunctions are possible and appropriate for other applications.

In an embodiment, the training process 602 proceeds to compute an errorgradient for each parameter of the speech synthesis model 604. Incertain embodiments, a gradient for selected parameters are computed.The training process 602 proceeds to apply adjustments to the weights ofthe speech synthesis model 604 according to the gradients. The amount ofadjustment can be scaled by a factor that controls the learning rate. Incertain embodiments, the factor is dynamic. For example, the factor canbe based on one or more performance metrics. Various other machinelearning techniques for training neural networks are possible inaccordance with embodiments described herein.

A pre-trained baseline speech synthesis model generates a particularvoice for the speech that it synthesizes. For example, a target voicewith a general accent, middle to young age, and neutral sounding gendermay be preferred. After having pre-trained a baseline speech synthesismodel, it is possible to perform transfer training by training animproved speech synthesis model that has one or more additional inputnodes to the neural network, the nodes indicating voice property values.This can enable the speech synthesis model to learn how to adapt thesound of the synthesized voice according to the voice property values.

For example, FIG. 7 illustrates example 700 of transfer training inaccordance with various embodiments. In this example, a configurablespeech synthesis model 704 takes as input a voice property value andtext. Voice property values may be chosen in a pattern or randomly. Inan embodiment, the voice property values can be compared to a diversitythreshold to generate a diverse set of voice property values. The speechsynthesis model 704 outputs synthesized speech audio. A discriminator710 trained as described above as shown in FIG. 5 , obtains thesynthesized speech audio and computes a probability. A training process708 compares the probability computed by the discriminator 710 to thevoice property value using a loss function represented by:

loss=probability of property−voice property value   Eq. (3)

This has an effect equivalent to minimizing the cross-entropy lossbetween two models, where, effectively, the output of one of the modelsis defined by the voice property values. It should be noted other lossfunctions are possible in accordance with embodiments described herein.The training process 708 proceeds to compute an error gradient forparameters of speech synthesis model 704. For example, training process708 computes an error gradient for one or more parameters. Trainingprocess 708 proceeds to apply adjustments to the weights of the speechsynthesis model 704 according to the gradients. The amount of adjustmentcan be scaled by a factor that controls the learning rate. Various othermachine learning techniques for training neural networks are possible inaccordance with various embodiments.

Joint Training

Rather than pre-training a baseline neural speech synthesis model andusing transfer training to turn it into a configurable neural speechsynthesis model, it is possible to train a model jointly tosimultaneously learn speech synthesis in general and configurabilityaccording to voice parameters.

FIG. 8 illustrates example 800 of jointly training configurable neuralspeech synthesis. A configurable neural speech synthesis model 804 takesin transcriptions of source training audio samples and one or more voiceproperty values on a continuous scale. In certain embodiments, voiceproperty values are related to the source samples of speech audio (e.g.,voice data from an individual such as a voice donor or machine-generatedvoice data from a TTS system or other audio generation system). In anembodiment, voice property values are disjoint from the source samples.The configurable neural speech synthesis model 804 outputs synthesizedspeech.

Training process 802 compares the synthesized speech with the sourcetraining audio sample corresponding to the text transcription. Trainingprocess 802 proceeds to compute a loss value and/or weight adjustmentaccording to an error gradient for parameters of the speech synthesismodel 804.

Discriminator 810, trained as described above in FIG. 5 , takes in thesynthesized speech audio and computes a probability. Training process808 compares the probability computed by discriminator 810 to the one ormore voice property values. Training process 808 proceeds to compute aloss value and/or weight adjustment according to an error gradient foreach parameter of speech synthesis model 804.

A combination 806 of the weight adjustment or computation of weightadjustments from loss values from training process 802 and trainingprocess 808 produces a combined weight adjustment according to the lossfunction represented by:

loss=WS (sum over bins(abs(recording spectrogram bin−speech spectrogrambin)))+WP (probability of property−voice property value)   Eq. (4)

where WS and WP are relative weightings of the effect of voice propertyvalue matching and training sample voice matching. This has the effectof training a synthesis model that can generate sounds according tovoice property values but not learning to generate an undesirable outputor other output signal satisfying the voice property values withoutgenerating the sounds represented by the input text.

In an embodiment, during a manual approach, the relative weightings thatgive the most accuracy per training time can be determined throughexperimentation. Additionally, or alternatively, the relative weightingscan be based on one or more performance metrics or other such factors.The combined weight adjustment is applied to the weights of the speechsynthesis model 804 according to the gradients. The amount of adjustmentcan be scaled by a factor that controls the learning rate. Various othermachine learning techniques for training neural networks are possible.

The result is a speech synthesis model 804 that can take in text and oneor more voice property values that the model 804 has learned and producesynthesized speech audio with a voice as defined by a user's setting ofthe voice property values.

Synthesis Using the Model

In an embodiment, a service of synthesizing speech audio from text and avector of voice property values for a specific desirable voice isprovided. This is useful, for example, to create pre-recorded messagesfor a telephone service interactive voice response (IVR) menu with menumessages such as “to continue in English, press 1” or “to check youraccount balance, press 2”. It is also useful for pre-recorded messagesin devices such as voice interactive web sites, mobile apps,advertisements, robots, or automobiles with messages such as “openingwindows” or “as you wish”. The voice, and its configuration, create abrand identity that users and consumers recognize.

The configuration operations can be provided through an applicationprogramming interface (API) that gives user-controlled access to thesynthesis operation on a server across a network. The synthesis can beperformed directly or locally. An API request or local function call cantake as arguments relevant voice parameters such as accent, vocal tractparameters such as deepness, and attitudes such as speed or excitementlevel.

FIG. 9 illustrates example 900 of a functional speech synthesis engine902. Speech synthesis engine 902 takes, as input, a plurality of inputsincluding, for example, text of a speech segment to synthesize, anaccent parameter value, a vocal tract parameter value, and an attitudeparameter value. A request for synthesized speech is received. Inresponse, speech synthesis engine 902 generates an output with speechaudio. In an embodiment, the output could be a stream or a file in aformat such as wave (WAV), Speex, Free Lossless Audio Coding (FLAC), orMotion Picture Expert Group Layer 3 (MP3).

A user, such as a system engineer, or a higher-level function that callsthe speech synthesis engine 902 can then incorporate the audio samplesinto a product. Providing a configurable speech synthesis service may bepart of a company's business model in which they charge money, forexample, per-message, as a subscription, per-project, or in per-unitroyalty agreement.

Users may call a speech synthesis function using a command line programsuch as one in a Linux shell or a software development environment inLinux®, Macintosh®, or Windows®. It is also possible to provide a web orbrowser-based graphical user interface (GUI) for system designers tosynthesize speech audio with values of configurable speech parameters.

For example, FIG. 10 illustrates example 1000 of a GUI for synthesizingspeech audio from text according to configurable parameters inaccordance with various embodiments. In this example, the GUI includes atext entry box 1002 for a user to enter text to synthesize. The text caninclude, for example, tags in the SSML language such as tags to indicatewords to be spoken with emphasis. The GUI includes slider bars 1004 thatdefine parameter values based on how far a graphic of a slider isbetween its left and right extreme. Conventionally, left is a smallvalue, and right is a large value. It should be noted that any graphicmay be employed in accordance with various embodiments, including, forexample, a graphic of a knob or dial, numerical text entry box, or othernumerical input methods or combinations of methods different fordifferent parameters or multiple controls with different methodssynchronized such as a slider that changes a value in a numerical textentry box.

Sliders in the GUI of FIG. 10 can be labeled. Some are labeled on theleft and right to give names to the extreme ends such as Female andMale. Some sliders have a single label, such as New Yorkness, thatindicates an amount of a single type of parameter value. The GUI of FIG.10 has 5 sliders that define parameter values for gender, age, amount ofa New York type of accent, amount of a Texas type of accent, andprominence of a Nasal sounding vocal tract. A user can independentlyconfigure each parameter. Some systems may enforce dependencies betweenparameters, such as having an increase in the New Yorkness parameterforce a corresponding decrease in the Texasness parameter.

After configuring a set of parameters 1004, a user can select a playbutton 1006 to hear a sample of some or all of the text synthesized intospeech audio played from the browser. This allows experimentation withthe sound of the voice before committing to a final output audio file.Some systems only synthesize and play a portion or multiplenon-contiguous portions of the entered text to make it difficult for auser to capture and save the playback sample without paying for thecustom-configured synthesized audio.

After a user is satisfied with the sound of the voice that they haveconfigured, they may select a button 1008 to download a file with thesynthesized speech audio of their input text. In an embodiment, a chargeor other consideration may be debited for the download according to somebusiness models.

Configuring a Speech Synthesizer

Some developers of computerized applications and embedded systems suchas automobiles, robots, smart speakers, appliances, and servers providevoice interfaces for such systems that require an ability to generatespeech audio for essentially any words at essentially any time that itis needed to provide a user experience. To provide a desired brandvoice, such systems can utilize a speech synthesis engine configured fortheir specific voice but not configurable for any other voice. In otherwords, a speech synthesis engine that is locked to a custom voiceconfiguration is “frozen” with locked voice property values. A frozen orlocked voice property value is a voice property value that remains thesame or constant. Speech synthesis technology providers can support thatby providing speech synthesis engines generated with selected voiceproperty parameter values and configured by a configurator interface.

A configurator can be provided through an application programminginterface (API), a software development kit (SDK) or similar methods.The configurator can provide user-controlled access to the synthesisoperation on a server across a network. In certain embodiments, theconfigurator can be provided directly or locally. An API request orlocal function call may take as arguments relevant voice parameters suchas accent, vocal tract parameters such as deepness, and attitudes suchas speed or excitement level.

A user, such as a system engineer, or a higher-level function can thenincorporate the generated speech synthesis engine into a product. In anembodiment, providing a speech synthesis engine configurator service maybe part of a company's business model in which they charge money forexample per-message, as a subscription, per-project, or in per-unitroyalty agreement.

Voice designers may access a configurator using a command line programsuch as one in a Linux® shell or a software development environment inLinux®, Macintosh®, or Windows®. It is also possible to provide a web orbrowser-based graphical user interface (GUI) for system designers toconfigure a speech synthesis engine.

FIG. 11 illustrates example 1100 of a GUI configurator for generating aspeech synthesis engine with a voice fixed according to configurableparameters in accordance with various embodiments. In this example, theconfigurator GUI has the same slider bars 1004 for the same voiceparameters and constraints as in the speech synthesizer GUI of FIG. 10 .As in FIG. 10 , the configurator GUI of FIG. 11 may additionally have atext entry box and play button to assess the sound of a voiceconfiguration. These are not shown in FIG. 11 . After a user issatisfied with the sound of the voice that they have configured, theymay press a button 1102 to invoke a function that generates the speechsynthesis engine and provides it as a file to download. In anembodiment, they may be charged for the download according to somebusiness models.

The speech synthesis engine may be provided as an executable binary, ashuman-

readable programming code in a language such as Python, or as a neuralnetwork architecture parameter set for use by standard neural networksoftware. Some generated speech synthesis engines that are delivered asexecutables or source code may support SSML tags or other dynamic tagsto affect the sound of synthesized speech.

Freezing Voice Parameters

After a user requests that the system generates a speech synthesisengine with a frozen set of voice parameters, the method of generationstarts by treating the voice parameter values as a set of neural networkinput features to a neural network trained to be configurable accordingto the voice values. The system then treats those input values to thenetwork as constants and forward propagates the constants into thehidden layer(s) of the neural network. Whereas the speech synthesisengine 902 for FIG. 9 takes text and voice parameters (accent, vocaltract parameters, and attitude) as input, the text would remain avariable input, but the voice parameters would be constant.

Each node of the first hidden layer comprises an activation function fedby a sum of input parameters multiplied by weights. The weights arelearned from the training process of the speech synthesis neural networksuch as the processes described in FIG. 6 , FIG. 7 , and FIG. 8 . Thevoice parameter values are multiplied by their respective parameterweights in the configuration method, added together, and included as aconstant bias amount within the node. If hidden layers other than thefirst have inputs directly from input voice property parameters, theycan be configured in the same way.

The result is a neural network comprising one or more inputs for textbut no inputs for the frozen voice parameters. The multiplications,additions, and activation functions in appropriate combinations may beprovided as human-readable source code in a language such as Pythonand/or in a framework such as TensorFlow. They may be compiled into anexecutable. Before the compiling or as part of the compilation process,hardware-architecture-specific optimizations may be performed such asparallelizing functions to make use of single instruction multiple data(SIMD) instructions within high-performance general-purpose processorsand digital signal processing (DSP) processors or may be divided asappropriate for the processing elements within graphics processing units(GPU).

Sets of voice properties constitute a voice vector. The speechconfigurator of FIG. 11 allows the user to select a button 1104 to savea vector of voice properties. The properties used to freeze the speechsynthesis engine in the configuration process may be saved as plaintext, XML, JSON, or other appropriate standard or a proprietary formatfor representing parameter values. Likewise, the voice property valuesused to synthesize speech in the GUI of FIG. 10 can be saved similarly.Such a button is not shown in FIG. 10 .

Voice Copying

Another possible service and method is to accept, through a userinterface, a recording of speech by a person with a voice that hasapproximately the sound desired for a product identity. A system canprocess the recordings using a discriminator such as the discriminator504 trained in the example of FIG. 5 . The discriminator outputs avector of probabilities that can be the values used to start the processof voice configuration or configured synthesis. A large amount of speechis usually best, or at least an amount satisfying a threshold amount ofspeech, but as few as several sentences may provide enough informationfor an acceptably accurate set of voice parameters for starting theexperimentation needed for branding.

Avatars

Some end-user systems that provide configurable neural speech synthesispresent a visual character to the user. Such a character may appear asan avatar, hologram, or other graphically generated display of acharacter that can speak. Users may interact with the system throughtyping, mouse-clicking, touch, gestures, or voice control. The user mayconfigure the character that they see. The configuration may be donethrough a menu, keyboard commands, or voice commands. An example of amenu would look similar to that of FIG. 10 or FIG. 11 but without adownload button. An example using voice commands would be for the userto speak a command such as, “Can you increase the Texasness by 20%?”.The system recognizes the speech as a natural language command toincrease a Texasness voice parameter input to a neural speech synthesisengine. As described above in other examples, age, gender, accent, etc.are types of parameters that may be configurable by users in somesystems. Users may perform such configurations by invoking a menu or byspeaking directly to the animated character that corresponds to thesynthesized voice being configured. This could be invoked with a voicecommand such as, “Hey Buddy, calm down and drop the New York accent.”

Brand Differentiation

A provider of voices may maintain a database. Also, or instead, anindustry-standard body may maintain a database or one or more nationaltrademark offices may maintain a database. The database being one thatstores voice vectors that produce voices associated with brands. Thedatabase can be used to ensure that no two brands have the same voice orvoices that are confusingly similar. However, it may be permissible fordifferent brands to use similar voices as long as the brands are fordifferent classes of goods and services.

FIG. 12 illustrates an example process 1200 for ensuring that brandshave distinct voices. The method begins with a step 1202 of receiving arequest to synthesize speech or generate a speech synthesis engine witha specific voice property vector. The method proceeds to a step 1204 ofreading one or more stored vectors from a brand database 1214. In a nextstep 1206 the method computes a cosine distance between the requestedvoice property vectors and the one or more voice property vectors readfrom the brand database 1214. The computation of the cosine distance maygive different weights to different properties since some properties ofa voice have a greater influence on brand perception than otherproperties.

If the smallest cosine distance of the requested voice property vectorto voice property vectors from the brand database 1214 is below athreshold distance, the method proceeds 1212 to provide an errormessage. It may then proceed to the step of receiving a voice propertyvector 1202 for a new voice property vector. If the smallest cosinedistance is above the threshold distance, the method may proceed to astep 1208 of storing the requested voice vector in the brand database1214 so that it may be compared to future requested voice vectors. Afterstoring the requested voice vector, the method may proceed to a furtherstep 1210 of generating code for a speech synthesizer. Additionally, oralternatively, the method may proceed to synthesize input text in thevoice defined by the requested voice property vector. There may be otherintermediate steps within implementations of the method of FIG. 12 .

It is also possible to store in the brand database 1214 an allowabledistance of exclusivity associated with each brand's voice propertyvector. Accordingly, the threshold for comparison is based on theexclusivity distance associated with each brand's voice property vector.Brand owners may pay to have a larger exclusivity distance. That willgive them a more distinct voice.

The allowable distance may be dynamic. For example, the allowabledistance may depend on the distance between goods and services withinthe same or different classes. For example, goods and services withinthe same or similar class may be associated with stricter thresholdsthan goods and services in different classes.

Trademark Examination

It is in the public interest for consumers to be able to identify thesource of goods and services. Most major countries of the world havelegal systems to prevent passing off of goods and services. To supportthe enforcement of the uniqueness of identifiers of goods and services,such countries keep registries of trademarks. These can include names,descriptive words, logos, distinguishing colors, and sounds. As we enteran era of voice-enabled goods and services, where the voices aredistinctive to brands, it is desirable to register voices as trademarks.Such voices can be defined with voice property vectors as describedabove. To ensure that an application for trademark registration isrequesting an appropriately distinctive trademark, it is necessary toexamine trademarks. However, a problem arises in the fact that it isdifficult for a human examiner to compare a voice specified in atrademark application with other existing voice trademarks.

FIG. 13 illustrates example method 1300 for examining voice trademarksin accordance with various embodiments. At first, a brand owner 1302performs a step 1308 of synthesizing speech with a distinctive voice toproduce an audio segment 1306 of speech audio. The brand owner providesthe audio segment 1306 to a trademark office 1304. The trademark office1304 receives the specimen of speech audio with an application fortrademark registration. The trademark office 1304 may require a minimumlength of speech to be able to distinguish its voice characteristicswith sufficient accuracy for examination.

The trademark office 1304 performs a step 1310 of applying adiscriminator to the audio segment 1306. A discriminator such as the oneshown in FIG. 5 may be appropriate, as it outputs a voice propertyvector for a plurality of voice property values. The trademark office1304 proceeds to a step 1312 of searching a database 1316 of registeredvoice property vectors. The search comprises computing distances betweenthe computed voice property vector of the registration application andother voice property vectors stored in the database 1316. The search maybe constrained to voice trademarks within a specified set of classes ofgoods and services.

If the smallest computed distance between the voice property vector ofthe audio segment 1306 of the registration application is within athreshold distance of another voice in the database 1316 for a claimedset of goods and services in a matching class then the trademarkregistration is to be refused. Otherwise, it may be further examined forpossible registration. The trademark office 1304 proceeds to prepare anoffice action 1314 for the brand owner 1302 indicating whether thetrademark registration is refused because of similarity to otherregistered voice trademarks.

FIG. 14A illustrates an example process 1400 for training a model (e.g.,a neural speech synthesis model or a speech synthesis model) that cangenerate speech audio conditioned on a value of a voice property inaccordance with various embodiments. It should be understood that therecan be additional, fewer, or alternative steps performed in similar oralternative orders, or in parallel, within the scope of the variousembodiments unless otherwise stated. In this example, source samples ofspeech audio (e.g., voice data from an individual such as a voice donoror machine-generated voice data from a TTS system or other audiogeneration system) can be obtained 1402. The source samples can beobtained from an individual such as a voice donor or machine-generatedvoice data. A variety of different methodologies may be used to retrievethe source samples, including but not limited to, data scrapes, APIaccess, etc. The source samples can be labeled 1404 with discrete valuesof a voice property, including, for example, a gender voice property, anage voice property, an accent voice property, a timbre voice property.Other voice properties may indicate whether the source samples indicatethe attitude of the speaker, such as whether the speaker appears happy,sad, calm, excited, formal, casual, etc. A discriminator can be trained1406 from the source samples and labels. The discriminator is configuredto generate a probability value that quantifies the likelihood of thevoice property from a sample of speech audio. A model (e.g., neuralspeech synthesis model or synthesis model) can be trained bysynthesizing 1408 a multiplicity of synthesized speech samples using themodel with a diverse set of voice property values. In certainembodiments, synthesizing uses a transcription of source samples. Inthis example, a source-matching weight adjustment is computed byback-propagating changes to minimize a loss function that depends ondifferences between the source samples and the synthesized speech.Corresponding properties can be computed 1410 for the synthesized speechsamples using the discriminator. Thereafter, a property-learning weightadjustment can be computed 1412 by back-propagating changes to minimizea loss function that depends on the difference between the voiceproperty values and corresponding probabilities.

FIG. 14B illustrates an example process 1420 for generating synthesizedspeech using a trained model in accordance with various embodiments. Inthis example, a string of text and at least one voice property value canbe received 1422 at a trained model (e.g., the neural speech synthesismodel or synthesis model). The string of text and voice property valuecan be associated with a perceptible meaning. For example, the voiceproperty values may define voice characteristics such as accents andattitudes. The string of text and voice property values can be receivedin accordance with embodiments described in FIG. 10 . For example, auser may utilize a GUI that includes a text entry box operable for theuser to enter text to synthesize. The GUI may further include sliderbars or other graphically elements or input fields that can be used todefine voice property values. In an embodiment, the string of text canbe associated with at least one text tag. For example, the string oftext can include tags in the SSML language to indicate words to bespoken with emphasis, allowing for dynamically configurable voiceparameter values. Speech audio corresponding to the string of text canbe synthesized 1424 using a neural speech synthesis model thatconditions a sound of speech audio on the at least one voice propertyvalue to generate synthesized speech audio. Thereafter, the synthesizedspeech audio can be outputted 1426, wherein the sound of the synthesizedspeech audio perceptually relates to the at least one voice propertyvalue. In certain embodiments, outputting the synthesized speech audiomay allow for downloading and/or playback of the synthesized speechaudio.

FIG. 14C illustrates an example process 1440 for configuring a speechsynthesizer in accordance with various embodiments. In this example, atleast one voice property value is received 1442. The voice propertyvalue in certain embodiments constitutes a voice property vector. Codefor execution by a computer is generated 1444. The code can be in abinary format. The code can be configured to implement a neural networkwherein a node in a hidden layer includes, in its summation, a constantterm derived from a product of the at least one voice property value anda weight learned from a training process. Thereafter, the code isoutputted 1446. In an embodiment, the outputted code, when executed, isconfigured to implement a speech synthesis function within the speechsynthesizer. For example, a user, such as a system engineer, can submita request for synthesized speech, and the received synthesized speechcan be incorporated into a product or used for another purpose. Inanother example, a function call can be received to ensure distinctvoices. For example, a request to synthesize speech or generate a speechsynthesis engine with a specific voice property vector is received. Atleast one stored voice property vector from a brand database is read. Adistance between the at least one stored voice property vector and thevoice property vector is computed. In the situation that the computeddistance satisfies a threshold distance, an error message can begenerated indicating that the voice closely resembles a stored voice,and it may be desirable to generate a different voice. In the situationthat the computed distance fails to satisfy a threshold distance,indicating a distinct voice, the voice property value can be stored inthe brand database for use in other purposes.

CRMs

Some examples described above are best performed on servers such as onesin data centers. For example, training of neural networks and hosting ofAPIs for speech synthesis or synthesis engine generation tend to beperformed on servers. The servers run software stored on non-transitorycomputer readable media.

FIG. 15A illustrates an example non-transitory computer readable medium191 that is a rotating magnetic disk. Data centers commonly use magneticdisks to store data and code comprising instructions for serverprocessors. The magnetic disk stores code comprising instructions that,if executed by one or more computers, would cause the computer toperform steps of methods described herein. Rotating optical disks andother mechanically moving storage media are possible.

Some implementations described above are best performed on personalcomputers such as laptops, mobile devices such as mobile phones andtablets, and embedded systems such as automobiles, robots, andappliances. For example, requesting configurable neural speech synthesisthrough an API, downloading and running speech synthesis engines, andrunning trademark examination software are best performed on suchdevices.

The skilled person will be aware of a range of possible modifications ofthe various embodiments described above. Accordingly, the presentinvention is defined by the claims and their equivalents.

FIG. 15B illustrates an example non-transitory computer readable medium193 that is a Flash random access memory (RAM) chip. Data centerscommonly use Flash memory to store data and code for server processors.Personal computers, mobile devices, and embedded systems commonly useFlash memory to store data and code for processors within system-on-chipdevices. The Flash device 193 stores code comprising instructions that,if executed by one or more computers, would cause the computer toperform steps of methods described herein. Other non-moving storagemedia packaged with leads or solder balls are possible.

Any type of computer-readable medium is appropriate for storing codecomprising instructions according to various embodiments.

The Server

Servers, such as ones common in data centers, are often implemented asrack-mounted server blades. They have invisible fans behind coolingopenings, blinking lights, and cable connections. FIG. 16A illustrates arack-mounted server blade multi-processor server system 195.

It comprises a multiplicity of network-connected computer processorsthat run software in parallel.

FIG. 16B illustrates a block diagram of the server system 151. Itcomprises a multicore cluster of computer processor (CPU) cores 152 anda multicore cluster of graphics processor (GPU) cores 153. Theprocessors connect through a board-level interconnect 154 torandom-access memory (RAM) devices 155 for program code and datastorage. Server system 151 also comprises a network interface 156 toallow the processors to access network-attached storage devicescomprising non-transitory computer readable media and the Internet. Byexecuting instructions stored in RAM devices 155, the multicore clusterof computer processor (CPU) cores 152 and GPUs 153 perform steps ofmethods as described herein.

Some embodiments function by running software on general-purposeprogrammable processors (CPUs) such as ones with ARM or x86architectures. Some power-sensitive embodiments and some embodimentsthat require especially high performance such as for neural networkalgorithms use hardware optimizations. Some embodiments useapplication-customizable processors with configurable instruction setsin specialized systems-on-chip, such as ARC processors from Synopsys andXtensa processors from Cadence. Some embodiments use dedicated hardwareblocks burned into field programmable gate arrays (FPGAs). Someembodiments use arrays of graphics processing units (GPUs). Someembodiments use application-specific-integrated circuits (ASICs) withcustomized logic to give the best performance. Some embodiments are inhardware description language code such as code written in the languageVerilog.

Some embodiments of physical machines described and claimed herein areprogrammable in numerous variables, combinations of which provideessentially an infinite variety of operating behaviors. Some embodimentsherein are configured by software tools that provide numerousparameters, combinations of which provide for essentially an infinitevariety of embodiments of the invention described and claimed.

Hardware blocks, custom processor instructions, co-processors, andhardware accelerators perform neural network processing or parts ofneural network processing algorithms with particularly high performanceand power efficiency. This provides long battery life forbattery-powered devices and reduces heat removal costs in data centersthat serve many client devices simultaneously.

As used herein any reference to “one embodiment” or “an embodiment”means that a particular element, feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneembodiment. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment.

Some embodiments may be described using the expression “coupled” and“connected” along with their derivatives. For example, some embodimentsmay be described using the term “coupled” to indicate that two or moreelements are in direct physical or electrical contact. The term“coupled,” however, may also mean that two or more elements are not indirect contact with each other, but yet still co-operate or interactwith each other. The embodiments are not limited in this context.

As used herein, the terms “comprises,” “comprising,” “includes,”“including,” “has,” “having” or any other variation thereof, areintended to cover a non-exclusive inclusion. For example, a process,method, article, or apparatus that comprises a list of elements is notnecessarily limited to only those elements but may include otherelements not expressly listed or inherent to such process, method,article, or apparatus. Further, unless expressly stated to the contrary,“or” refers to an inclusive or and not to an exclusive or. For example,a condition A or B is satisfied by any one of the following: A is true(or present) and B is false (or not present), A is false (or notpresent) and B is true (or present), and both A and B are true (orpresent).

In addition, use of the word “a” or “an” are employed to describeelements and components of the embodiments herein. This is done merelyfor convenience and to give a general sense of the invention. Thisdescription should be read to include one or at least one and thesingular also includes the plural unless it is obvious that it is meantotherwise.

Upon reading this disclosure, those of skill in the art will appreciatestill additional alternative structural and functional designs for asystem and a process for creating an interactive message through thedisclosed principles herein. Thus, while particular embodiments andapplications have been illustrated and described, it is to be understoodthat the disclosed embodiments are not limited to the preciseconstruction and components disclosed herein. Various apparentmodifications, changes and variations may be made in the arrangement,operation and details of the method and apparatus disclosed hereinwithout departing from the spirit and scope defined in the appendedclaims.

1. A computerized process of training a neural speech synthesis modelthat can generate speech audio conditioned on a value of a voiceproperty, the computerized process comprising: obtaining source samplesof speech audio; labeling the source samples with discrete values of avoice property; training, from the source samples and labels, adiscriminator that can compute a probability of the voice property froma sample of speech audio; and training the neural speech synthesis modelby: synthesizing a multiplicity of synthesized speech samples using theneural speech synthesis model with a multiplicity of values of the voiceproperty to generate synthesized speech samples, computing correspondingprobabilities for the synthesized speech samples using thediscriminator, and computing a property-learning weight adjustment tothe neural speech synthesis model by back-propagating changes tominimize a loss function that depends on differences between values ofthe voice property and corresponding probabilities.
 2. A speechsynthesis model obtained by the computerized process of claim
 1. 3. Thespeech synthesis model obtained of claim 2, wherein the speech synthesismodel is configured to: receive a string of text and at least one voiceproperty value with a perceptible meaning; synthesize speech audiocorresponding to the string of text using a neural speech synthesismodel that conditions a sound of speech audio on the at least one voiceproperty value to generate synthesized speech audio; and output thesynthesized speech audio, wherein the sound of the synthesized speechaudio perceptually relates to the at least one voice property value. 4.The speech synthesis model of claim 3, wherein the at least one voiceproperty value includes at least one of a gender voice property, an agevoice property, an accent voice property, a timbre voice property, or anattitude voice property.
 5. The speech synthesis model of claim 3,wherein the speech synthesis model is further configured to: enabledownload of the synthesized speech audio.
 6. The speech synthesis modelof claim 3, wherein the speech synthesis model is further configured to:enable playback of the synthesized speech audio.
 7. The speech synthesismodel of claim 3, wherein the speech synthesis model is furtherconfigured to: provide a graphical user interface that includes one of atext input field or a voice property value input field.
 8. The speechsynthesis model of claim 3, wherein the string of text is associatedwith at least one text tag.
 9. The speech synthesis model of claim 3,wherein the string of text indicates dynamically configurable voiceparameter values.
 10. The computerized process of claim 1, wherein thesynthesizing uses a transcription of source samples, the computerizedprocess further comprising: computing a source-matching weightadjustment by back-propagating changes to minimize a loss function thatdepends on differences between the source samples and the synthesizedspeech samples.
 11. A speech synthesis model obtained by thecomputerized process of claim
 10. 12. The computerized process of claim1, wherein the source samples of the speech audio are obtained from oneof a person and an audio generation system.
 13. The computerized processof claim 1, wherein the voice property includes at least one of a gendervoice property, an age voice property, an accent voice property, atimbre voice property, or an attitude voice property.
 14. A computersystem for training a neural speech synthesis model to generate speechaudio conditioned on a value of a voice property, comprising: at leastone processor; and memory including instructions that, when executed bythe at least one processor, cause the computer system to: obtain sourcesamples of speech audio; label the source samples with discrete valuesof a voice property; train, from the source samples and labels, adiscriminator that can compute a probability of the voice property froma sample of speech audio; and train the neural speech synthesis modelby: synthesize a multiplicity of synthesized speech samples using theneural speech synthesis model with a multiplicity of values of the voiceproperty to generate synthesized speech samples, compute correspondingprobabilities for the synthesized speech samples using thediscriminator, and compute a property-learning weight adjustment to theneural speech synthesis model by back-propagating changes to minimize aloss function that depends on differences between values of the voiceproperty and corresponding probabilities.
 15. The computer system ofclaim 14, wherein the at least one voice property value includes atleast one of a gender voice property, an age voice property, an accentvoice property, a timbre voice property, or an attitude voice property.16. The computer system of claim 14, wherein the neural speech synthesismodel is further configured to: enable download of the synthesizedspeech audio.
 17. The computer system of claim 14, wherein the neuralspeech synthesis model is further configured to: enable playback of thesynthesized speech audio.
 18. The computer system of claim 14, whereinthe neural speech synthesis model is further configured to: provide agraphical user interface that includes one of a text input field or avoice property value input field.
 19. The computer system of claim 14,wherein the string of text is associated with at least one text tag. 20.The computer system of claim 14, wherein the system uses a transcriptionof source samples, and wherein the instructions when executed furthercause the computer system to: compute a source-matching weightadjustment by back-propagating changes to minimize a loss function thatdepends on differences between the source samples and the synthesizedspeech samples.